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Abstract. 

c/3 ■ The study of community structure has been a hot topic of research over the last 

c/3 \ years. But, while successfully applied in several areas, the concept lacks of a general 

and precise notion. Facts like the hierarchical structure and heterogeneity of complex 

networks make it difficult to unify the idea of community and its evaluation. The 

global functional known as modularity is probably the most used technique in this 

O |- area. Nevertheless, its limits have been deeply studied. Local techniques as the ones 

by Lancichinetti et al. and Palla et al. arose as an answer to the resolution limit and 

degeneracies that modularity has. 

Here we start from the algorithm by Lancichinetti et al. and propose a unique 

growth process for a fitness function that, while being local, finds a community partition 
i — i 

' that covers the whole network, updating the scale parameter dynamically. We test the 

quality of our results by using a set of benchmarks of heterogeneous graphs. We discuss 
\Q | alternative measures for evaluating the community structure and, in the light of them, 

■ infer possible explanations for the better performance of local methods compared to 

global ones in these cases. 
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1. Introduction 

In the last years community detection became one of the top research topics in the area 
of Complex Networks. Due in part to the explosion of social networking, but also to 
its application in diverse areas as ecology and computational biology, an interest arose 
in defining, detecting, evaluating and comparing community structures. For a thorough 
-yet not exhaustive- reference of its applications see the survey by [Fortunato, 2010]. 

The early research by Newman departed from the use of betweenness to divide 
the network into modules [Girvan and Newman, 2002] , and the definition of modularity 
to evaluate communities [Newman and Girvan, 2004]. Then he proposed using the 
modularity as a functional to be maximized [Newman, 2006]. Different optimization 
techniques were developed, of which we recall the algorithm by Guimera based on 
simulated annealing [Guimera and Nunes Amaral, 2005] for its good results, and the 
Louvain algorithm [Blondel et al., 2008] for its fast convergence within large networks. 

Later, the works by [Good et al., 2010] and [Fortunato and Barthelemy, 2007] 
questioned the global optimization methods based on modularity, for being prone 
to resolution limits and extreme degeneracies. Local techniques were proposed, 
as the Clique Percolation Method (CPM) in [Palla et al., 2005], and the algorithm 
in [Lancichinetti et al., 2009], based on a fitness function. Both of them find overlapping 
communities, and in the latter, a different notion of community as a natural community 
arose. The natural community of a vertex is a locally-computed set, and its size depends 
on a resolution parameter a. 

It has also been observed that the resolution limits for modularity 
found in [Fortunato and Barthelemy, 2007] are particularly common in heteroge- 
neous graphs with heavy-tailed community sizes and vertex degree distributions 
(see [Fortunato, 2010], section VI. C). In these graphs, small communities will often 
be masked into larger ones by modularity maximization techniques when they are in- 
terconnected just by a few links. 

In order to detect the communities we define a fitness function following the ideas 
in [Lancichinetti et al., 2009]. After analyzing the role of the resolution parameter a 
in these functions, we propose a uniform fitness growth process which scans the whole 
graph and whose parameter is updated dynamically. Then, we extract a community 
partition from the output of this process. The details of our method are described in 
sections 2 and 3, and the algorithmic complexity is discussed in section 4. 

In section 5 we use a benchmark developed in [Lancichinetti et al., 2008] to build 
a dataset of heterogeneous networks. The results that we obtained show an important 
improvement using our fitness growth process when compared to the global modularity 
maximization techniques, which suggests that local methods may outperform global 
ones in these cases. In order to discuss this conjecture, we propose a correlation-based 
measure of community structure and use it to visualize the differences in performance 
between the two methods, giving a possible explanation. 

As a measure for comparing community structures, [Danon et al., 2005] proposed 
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using the normalized mutual information. We shall use it in order to make comparisons 
with global methods and with community structures known a priori. We also apply 
the algorithm to real networks and show the results. Finally, we discuss the robustness 
(repeatability of the results) of our process. 

2. Our method 

[Lancichinetti et al., 2009] defines a process based on a fitness function with a resolution 
parameter a such that, given a set C C V: 



where ki n is the number of edges that join vertices in C, and fc out is the number of edges 
that join some vertex in C to some vertex not in C. Applying this process to any vertex 
v, the natural community of v is obtained. In some way, the resolution parameter a is 
related to the natural community size. 

Starting with a community made up by the seed vertex v, their algorithm proceeds 
by stages, where in each stage the steps are: 1) select a vertex whose addition increments 
the fitness function, and add it to the actual community; 2) delete from the actual 
community all the vertex whose deletion increments the fitness function. 

The algorithm stops when, being in stage 1, it finds no vertex to add. Step 2 is 
time-consuming, and usually very few vertices are deleted, but it is necessary due to the 
local, vertex-by-vertex nature of the analysis. The authors called the final result of the 
algorithm the natural community associated to v. 

In order to obtain a covering by overlapping communities, they select a vertex at 
random, obtain its natural community, select a vertex not yet covered at random, obtain 
its natural community, and so on until they cover the whole graph. 

In all this process, the resolution parameter a of the fitness function is kept fixed. 
The authors perform an analysis in order to find the significant values of a. 

Our contribution extends that work to define a uniform growth process. This process 
covers the whole graph by making a course throughout its communities. We modify the 
fitness function f(C) and analyze the role of a in the termination criteria for the process. 
Then we propose an algorithm for increasing the fitness function monotonically while 
traversing the graph, dynamically updating the parameter. Finally, a cutting technique 
divides the sequence of vertices obtained by the process, in order to get a partition into 
communities. 

2.1. Previous definitions 

We shall deal with simple undirected graphs G = (V,E), with n = \V\ vertices and m 
edges (here |.| denotes the cardinal of a set). To avoid unnecesary details, we assume 
that E C V x V is such that (v,w) G E implies that (w,v) G E. 



f(C) 



•in 
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We set Se(v,w) = 1 if (v,w) G E, S E (v,w) = in the other case. We have then 
the following expression for the degree of a vertex v 

deg(w) = ^2 &e(v,w) . 
wev 

Thus, \E\ = J2w€V deg(u>) = 2m. We shall use two measures, my and the, the first one 
on V and the second one on V x V. Given C C V, 

m v (C) = J2^s(v)/\E\ 

vec 

is the normalized sum of the degrees of the vertices in C. Given D C V x V, 

m E (D) = S E (v,w)/\E\ . 

(v,w)eD 

Notice that when C±, C2 C V are mutually disjoint, m E [C\ x C2) is the normalized cut 
between C\ and C2. The cut(C\,C2) is, in this case, the set of pairs (v,w) G -E such 
that v G Ci and w G C 2 . Notice also that my is the marginal measure of m E , and 
that these measures are in fact probabilities. For C G V, we shall denote for simplicity 
m E (C) = m E (C x (7), where C — V\C. 
Let C C V, and «GV. We denote 

ki c (v) = ^2 $e(v,w) 
wee 

and 

ko c (v) = ^2 $e{v,w) . 

Thus kic(v) is the number of vertices in C joined to v, and koc(v) is the number of 
vertices not in C joined to v; of course kic(v) + koc(v) = deg(t>). 

We shall also use ski(C) = X^ec* kic(v), and sko(C) = ^2 v£C ko c (v) . 

2.2. A growth process 

Consider a fitness function /, associating to each C CV a real number f(C). 

Given v G V, we shall consider a growth process for / with seed v: it consists of a 
double sequence 

A)0, D W) . . . , D lkl , . . . , D a0 , . . . , -D a fc a , • • • , -DfeO; • • • 5 Dbk b 

of subsets of V. Thus, for each a such that < a < b, we have a subsequence 
D a0 ,...,D aka (a,be N). 

• Doo = {^}, ^0 = 0. 

• For a > 0, D( a+ i) = -D a fc a and -D( a+ i)i is obtained from D( a+1 ) by adding to it one 
vertex such that f(D^ a+1 ) 1 ) > /(D (o+ i )0 ). 

• For fc > 1, D a (fc + i) is obtained from D ak by elimination of a vertex (different from 
the seed vertex v), such that /(-D a (fc+i)) > f(D ak ). 
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In addition, we assume that for each a > 0, there is no vertex w G D aka such that its 
elimination induces an increase in /, and that there is no vertex out of whose 
addition induces an increase in /. Alternatively, we may describe the process by 
v + SiWi + ...s r w r , where the signs Sj (1 or —1) determine whether the vertex w,i 
is added or eliminated in this step, for example v + ui\ + w 2 + + — + wq means 
that in the first four steps we added w±, w 2 , w 3 , w^, in the fifth step we eliminated w 5 
(which of course must be equal to some of the previously added vertices) and in the 
sixth step we added wq. 

2.3. Concrete cases 

For C C V, consider my(C),mE(C), which we shall abbreviate mv,mE when there is 
no place for ambiguity. Recall that m v is the normalized sum of the degrees of the 
vertices in C, and mg is the normalized cut defined by C. 



We shall deal with two parametric families of fitness functions, with a real parameter 
t > 0: 



H t = my(l — m v /2t) — m E . 

The first of these families is equivalent to the one used by the authors in 
[Lancichinetti et al., 2009], with a — l/t. 

2.4- A differential analysis 

Let C C V, and w £ V. Suppose that we are to add w to C, if w £ C, or to eliminate 
w from C, if w e C, obtaining in either case a new set C = C ± w. Let us denote 
Amy = mv(C) — my(C), Aire = m,E(C') — m^C), and s,t > two fixed values of 
the parameter. Then we have the following approximate expression for the difference 
quotient of L t , 



L, 



m v — m E 



't 




and 



Amy ~ * m f V Amy 
For the difference quotient of H t we obtain 




' 11 / — I x — — 

Amy ' \ Amy t 
Notice then the following relations 




H' t =H' S + ^m, 
mfL' t = m l ' s L' s + tzJ.^ 



(2) 
(3) 



(1) 



H' t =rr% t L' t + (L 1 -m v )/t 
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Equation 1 shows us that if t > s and H' s > 0, then H' t > 0, which means that if 
the vertex w is a candidate for addition (elimination) to C (from C) for the H s process, 
it is also a candidate for addition (elimination) for the H t process. 

Equation 2 shows us analogously that if t > s and L' s > 0, then L' t > 0, which 
means that if the vertex w is a candidate for addition (elimination) to C (from C) for 
the L s process, it is also a candidate for addition (elimination) for the L t process. 

This shows that the parameter t does not play an essential role during the growth 
process for H t or L t , but merely establishes the termination criteria. 

Equation 3 shows a delicate fact: If a vertex w is a candidate for addition 
(elimination) for the L t process, and my < L\ (this is usually true, notice that when 
my > Li, m E > m v {l — my), which contradicts the notion of community, because 
the second term would be the mean of the first one if the vertices were to be selected 
randomly) then it is a candidate for addition (elimination) for the H t process. Thus, both 
processes are essentially equivalent, their difference lying in the termination criteria. In 
exceptional cases, communities obtained with the H t fitness functions are bigger than 
those obtained with the L t fitness functions. 

Of course, there are approximations involved, so that our previous comments are 
rough and qualitative: our experience testing both fitness functions confirms them. 



2.5. Natural communities 

The following is a formalization of the procedure described in [Lancichinetti et al., 2009] 
to obtain the natural community of a vertex v, generalized for any fitness function. 

Algorithm 1: Natural communities 

Input: A graph G = (V,E), a fitness function /, a vertex v G V 
Output: A growth process D 00 , D w , D a0 , D aka , . . . , D b0 , . . . , D bkb 
l.l begin 

1.2 
1.3 
1.4 
1.5 
1.6 
1.7 
1.8 
1.9 
1.10 
1.11 
1.12 
1.13 

1.14 end 



Am = M 
m = 

while there exists w out of D m0 such that f(D m0 + w) > f(D m0 ) do 
D m i = D m0 + w 
k = 1 

while there exists w E D m k,w ^ v : f(D m k — w) > f(D m k) do 

An(fc+1) = Anfc - W 

k = k + l; 

end 



D( m +1)0 — D 

m = m + 1 



rah 



end 



The output of this "algorithm" is a growth process for /, v + Wi + w 2 ± w 3 ± 
. . . ± w r -i + w r , such that there is no w not in D r0 with f(D r0 + w) > f(D r0 ). Each 
Djo, < j < k satisfies that there is no w G Dj , w ^ v , such that f(Dj — w) > f(Dj ). 
D rQ is a possible "natural community" with seed v. 
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Remark: Notice that the preceding prescription is not complete, because both the 
w that we choose to add, as well as the w that we choose to eliminate, depend upon a 
criterion that we do no fix. 

2.6. Uniform growth processes 

In the previous Section we have described a method to obtain a natural community 
with seed v and fitness function /. Applying this with f — H t and fixed t, for different 
values of t we obtain different communities. Although it is not strictly true that "the 
bigger the t, the bigger the community", we have noticed in our differential analysis 
that this is essentially the case. Thus, it is reasonable to wonder whether it is possible 
to obtain all these communities with a unique process, starting with the smallest ones 
and proceeding with the biggest ones. The answer is affirmative, as we shall see now. 

Let us assume that we have our parametric family of fitness functions H t : < t. 
Given C and w G V such that kic(v) > 0, there always exists t c = t c (C,w) > such 
that H tc (C + w) — H tc (C). Indeed, we have: 

H t (C + w) = [m v + Am y )(l - (m v + Am v )/2t) - (m E + Am E ) 

= m v (l — m v /2i) — m E — (my + Amy/2) + Amy — Am^ 

t 

= H t {C) - — (% + Amy/2) + Amy - Am E 

t 

and it follows that 

Amy(my + Amy/2) 
Amy — Am E 
satisfies our exigencies. We also see that 

AH t = -^^(m v + Amy/2) + Amy - Am E 

t 

and it follows that AH t > when t > t c and w C ', and that AH t > when t < t c 
and w G C. 

Let v + YldLi s i w i be an algebraic expression with the previously introduced 
meaning, where of course we assume that each time that we eliminate a vertex, that 
vertex had previously been added. Let Co — v and for r > 0, C r = v + Yli=i s i w i- 
We assume that for each r, < r < M, kic r (w r+ i) > 0. We shall consider values 
= t ,ti, . . . , t r associated to this expression, t r = max{t r -i,t c (C r -i,w r )} when s r = 1, 
t r = t r -i < t c (C r -i,w r ) when s r = —I. Thus, t , . . . ,t r is & non- decreasing sequence, 
and Co, . . . , C r is a growth process for H t if t > t r . We call Co, ... , Cm a uniform growth 
process for H . 
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Algorithm 2: A growth process for H 



an- 



■ D, 



ak a > 



2.1 

2.2 

2.3 

2.4 

2.5 

2.6 

2.7 

2.8 

2.9 
2.10 
2.11 
2.12 
2.13 
2.14 
2.15 
2.16 
2.17 end 



Input: A graph G = (V, E), a vertex v £ V 
Output: A growth process for H: D o,D w , 
begin 

Doo = M 

ta = 

m = 

while there exists w not in D m0 do 

let w be such that t c (D m0 , w ) = mm w g Dm0 (t c (D m0 ,w)) 
t a = ma,x{t a ,t c (D m0 ,w a )} 
D m i = D m0 + w a 
k = 1 

while there exists w <G D m k,w ^ v : t c (D m k,w) > t a do 



b0, ■ ■ ■ 



,D bkb 



D 



m(k+l) 

fc + 1; 



D 



mk 



end 

m = m + 1 



mk 



end 



The output of this "algorithm" is a uniform growth process for H, which ends 
by covering the whole graph. The successive truncations of the sequence thus 
obtained are natural communities for v at different resolutions. In the sequel we 
assume -with empirical evidence- that these natural communities are made up of small 
sub communities, which are inserted one after another during the growth process. The 
following section explains how to detect these communities. 



3. Extracting the communities in three stages 

The previous section described the growth process, which outputs a sequence C r = 
v + Y7i=i s i w i- Some vertices of the graph may be inserted, removed and later reinserted 
during this process. So as a first step we filter the sequence to generate a new one which 
only keeps the last insertion of each vertex. In this way we obtain a subsequence S 
of the original one, such that each vertex appears once and only once throughout it. 
Now, as the growth process tends to choose the vertices by their strong linkage to the 
natural community built so far, we state that two consecutive vertices in the sequence 
either belong to the same community or either are border vertices. Considering that 
the first case is the most frequent, an algorithm is needed in order to cut that sequence 
into communities. This section presents our approach in three stages to obtain the 
final partition of the graph. Briefly, the first stage turns the sequence of vertices into a 
sequence of communities. It makes use of a division criterion defined by a function R{v ) 
in order to decide if a vertex v will stay in the same community as the previous vertex 
in the sequence or it will start a new community. The second stage will join consecutive 
communities in order to improve the community structure, and the last stage will move 
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individual vertices from one community to another. 
3.1. Stage One: Making cuts in the process 

In this first stage we divide the sequence S to obtain a list of communities C = 
(Ci, C2, Cm)- These communities are composed by vertices which are consecutive 
in the sequence. The cuts are made by observing the behavior of the function 



where S(w) are the sublists of S, from the first vertex in the sequence, up to w. 

Figure 2 sheds some light on why this function is useful to identify 
"sub communities" , i.e., elementary groups which will later take part in the final 
communities. 

In fact, what happens is that when the process leaves a subcommunity of 
strongly connected vertices and adds any vertex from outside, there is a decay in 
the function value, due to the relatively scarce number of connections between the 
subcommunity and the new vertex. Figure 1, obtained processing the dolphins 
network [Lusseau and Newman, 2004], shows a clear decay in position 36 when the 
process jumps between the two known communities [Newman and Girvan, 2004]. 

The R{y ) function cuts the sequence whenever it finds a minimum value which 
is smaller than the last minimum. This fact indicates that we have reached a valley 
between two bellies of the curve, which belongs to an inter-community area. This is 
quite an aggressive criteria, as sometimes frontier vertices may produce unnecessary 
cuts. This does not represent a problem, because this small communities taken from 
the border will be joined to their actual communities during the next stages. This is 
the case of the vertices in positions 36, 39 and 54 in Figure 2. This figure illustrates the 
three stages for the dolphins network. 

3.2. Stage Two: Joining successive sets to get communities 

In this step we join consecutive sub communities (Cj,Cj + i) from stage 1, based on the 
following criteria: when cut{Ci,Ci + \) > ski{Ci) or cut{Ci,Ci + \) > ski{Ci + \) (which 
means that the subcommunity has more connections to the other one than to itself), 
then the sub communities are merged and form a new community C[. The step finishes 
when no more consecutive subcommunities can be joined. 

3.3. Stage Three: Reclassifying vertices 

In order to correct the possible errors of the fitness growth process, we apply this last 
step, which is similar to the previous one, but with a vertex granularity: if any vertex w 
has more connections to some other community Cj than to the one it belongs to, then 
the vertex is moved to Cj. When this stage finishes every vertex is more attached to 



R(w) 



ki S ( w )(w) - ko s{w) (w) 
ki S (w)(w) + ko S ( w )(w) 



(4) 
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10 20 30 40 50 60 



Index for v in the growth process 

Figure 1. The cuts in the growth process for the dolphins social net- 
work [Lusseau and Newman, 2004]. The cut vertices (in black) are: 44, 36, 3, 0, 39, 7, 
1, 41, 57. 

its own community than to any other, which is quite a strong condition on community 
membership. 

We sweep over all the vertices looking for misclassified ones, and when no vertex can 
be moved the algorithm stops. We have observed a fast convergence and stabilization of 
this stage in all the test networks that we used. During the first run, all vertices tend to 
move to their right community, and in the second and third runs the amount of moving 
vertices sharply decreases. 

4. Algorithmic Complexity 

In this section we provide complexity bounds for the growth process and for the three 
stages. We shall use the notation N(v) for the neighborhood of v (the set of vertices 
which have an edge with v). Similarly, N(C) will denote the set of communities whose 
vertices have at least one neighbor in C. Finally, we call rf max = max{deg(t>), t> G V}. 

Growth process. The growth process is a sequence of vertex insertions interleaved 
with some eliminations. During all our experiments, we verified that the eliminations are 
scarce and they do not affect the order of complexity of the process. So we shall analyze 
the complexity for a growing process with no eliminations, such that the community 
size grows linearly from 1 to n on each step. Let's consider step k: we must analyze 
the inclusion of all the community neighbors, that is, all the vertices outside C which 
have some neighbor in C; as k vertices are inside C, the outsiders can be bounded by 




Figure 2. The three stages of the algorithm in the dolphins network. The vertices 
were positioned according to their communities after the third stage. Picture generated 
with the igraph package for R [Csardi and Nepusz, 2006]. The picture for the first 
stage matches with the cuts in Figure 1 (from left to right) in the following way (initial 
vertex, color and shape): 12, dark gray circles; 44, white circles; 36, light gray circles; 
3, black circles; 0, white rectangles; 39, gray rectangles, 7, dark gray rectangles; 1, 
black rectangles; 41, light gray rectangles; 57, gray circles. 
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n — k. For each of them we evaluate t c (C, w). This implies computing Am v and Am^: 
Amy comes from the vertex degree, while Am^ is related with kic and koc- So this 
computation is direct and does not depend on the size of the network. The minimum 
t c (C, Wi) wins and w,i in inserted into the community C. The last step consists on 
updating the ki and ko for the neighbors of w, and for w itself. For each of them we 
shall increase ki by 1 and decrease ko by the same amount. The complexity of this last 
step is then \N(w) \ + 1. 

Expanding the analysis for step k to all the process, we get: Ylk=i (n — k) + \N(w) | + 1 < 
n 2 + n ■ d max + n. This makes a complexity of 0{n 2 ). 

Stage 1. In the cutting algorithm the process is run through only once, from the 
begin up to the end, and for vertex v iy the cut decision is made based on R(yi-i), R{yi) 
and R(vi + \), where i refers to the position of the vertex in the growth process. The 
complexity here is 0(n). 

Stage 2. For the merge of communities which are consecutive in the process, we 
need a matrix with all the cuts cut(Ci,Cj), and also the values of ski and sko for 
each community. In order to precompute all this, we must consider each edge in the 
network, so it has a cost of 0(m), and requires a memory of 0(|C| 2 ) (in order to build 
the adjacency matrix of communities). Now, after building this structure, we start 
merging consecutive communities. We can bound the number of merges with \C\, and 
for each merge we analyze all the possibilities, i.e., all the pairs (Cj, Cj+i), which totalize 
(|C| — 1). Evaluating the convenience of joining Cj and Cj+i is 0(1), as it only involves 
the pre-computed values of ski and sko. So the selection of the best merge is 0(|C|). 
Finally, the update of the cuts cut{C^ Cj) for the neighbor communities of both implies 
\N(Ci)\ accesses to the matrix. Updating the values of ski and sko is immediate. In 
conclusion, the merge complexity is 0(|C|) and the number of merges is bounded by 
|C|. As |C| is bounded by n, the cost of stage 2 is 0(n 2 ). 

Stage 3. Here we analyze each pair (f,C), where C is a community such that its 
vertices have one or more links to v. In order to decide if we move v to C, we use an 
ordered record of the cuts cut(v,C). Building the record at the beginning costs 0(m), 
just as in Stage 2. Then, we analyze all vertices (0(n)) to find the best community for 
each of them, and if we move the vertex, we must update the record, with a cost of 
deg(f ). Now, this makes a complexity of 0{m + A • n • deg(t> )), where A is the number 
of traverses over all the vertices. Bounding this number with a fixed value -based on 
empirical observations-, the complexity is also 0{n 2 ). 

5. Results and Data Analysis 

In this section we exhibit the results of our local method applying it to (i) a benchmark 
of heterogeneous networks, (ii) real networks of different sizes, (Hi) random networks. 
We develop a brief explanation about mutual information as a metric in 5.1, and in 5.3 
we propose a correlation-based measure which shall be useful to understand the limits 
of global methods. Finally we show that the algorithm is robust for large networks with 
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a well-defined community structure. 



5.1. Mutual Information 

For the purpose of comparing different community structures, we used the normalized 
mutual information [Danon et al., 2005]. In order to define it in terms of random 
variables, we consider the following process: we pick a vertex v at random from V 
with a uniform distribution, and define the variable X related with partition C\. This 
variable assigns to each vertex the subindex of the community it belongs to. Clearly, 
the distribution of X is 

P[X = i ]=p, l = 1 -^, , (5) 
where i — 1, 2, \C\\. The entropy of C\ can now be defined as: 

\Ci\ 

H{C 1 ) = -Y,P l -log{p*) . (6) 

i=i 

If we introduce a second partition C 2 with its related variable Y under the same 
process, then the joint distribution for X, Y is 

p[x = i,Y = A=py = ¥i^, , (7) 

where % = 1, 2, \Ci\, j = 1, 2, \C 2 \. In these terms, the normalized mutual 
information is expressed as: 



NMI{d , C 2 ) = -2 • |r . J — , C ^' P3/ , (8) 

ZtlPi- log + log fa) 

where ]r£] J2f=iPij • lo 9 (^jr) = MI(&, C 2 ) is the mutual information. The following 
equality holds: 

MI(C 1 ,C 2 ) = Hid) + H(C 2 ) - H(C 1 ,C 2 ) , (9) 

where H(C\,C 2 ) is the joint entropy. NMI(Ci,C 2 ) falls between and 1, and gives an 
idea of the similarity between partitions in terms of the information theory, i.e., in terms 
of the information about C\ that lies in C 2 , or vice versa. 

The inherent idea is that a partition C of a graph gives us some information relative 
to the classification of vertices into groups. This amount of information is measured by 
its entropy, H(C). 

In fact, the denominator in NMI(C±,C 2 ) together with the —2 constant represent 
a normalization by the average entropy of the partitions, H ( Cl )+ g ( C2 ) _ a normalized 
mutual information of 1 implies that the partitions are coincident. 
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5.1.1. Normalizations and triangular inequalities We remark that other normalizations 
of the mutual information also exist, like: 

NMh{c ^ )=i ~mM (10) 

which has the advantage that 1 — NMI 2 is a metric [Vinh et al., 2009]. Although we 
consider it more correct to use this normalization, we shall hold to the first one for the 
purpose of comparison with other works in the literature. Anyway, we were able to find 
a transitivity property on NMI too (we shall call it NMIi here). In fact, observing 
that: 

2 C 2 ) 



1 - NMhiC^d) Hid) + H{C 2 ) - H(&,C2) 

1 _ g(gi) + H{C 2 ) 

1 - NMI 2 (C 1 ,C2) # (&) + H(C 2 ) - H(C U C 2 ) 

we can deduce a functional relationship between these two: 

2 1 



(11) 
(12) 

l-NMh{C±,C 2 ) l-NMI 2 {d,C2) ' (13) 

This relationship produces an hyperbole as in Figure 3. The good behavior of the 
function around (1, 1) assures that values of NMI\ close to 1 imply values of NMI 2 
close to 1 too. The transitivity of the metric implies that if NMI 2 (x,y) > 1 — e and 
NMI 2 (x,z) > 1 — e, then NMI 2 (y,z) > 1 — It. Then, by the functional relationship, 
NMIi(y, z) will be somehow close to 1 too. 

In other words, if NMI{C R , d) is high and NMI{C R , C 2 ) is high, then NMI{d,C 2 ) 
is also high. This result will be used in section 5.4, where Cr is a reference partition 
used to analyze our algorithm's robustness. 



5.2. Benchmarking with a set of heterogeneous networks 

5.2.1. Benchmark description We evaluated our algorithm with a benchmark proposed 
in [Lancichinetti et al., 2008]. We used their software to create sets of 10,000 
heterogeneous random graphs, with different power laws for the vertex degree 
distribution (exponent a) and the community size distribution (exponent (3), as well 
as different mixing parameters fi. 

We constructed graphs of 1,024 vertices, with (deg(t> )) = 10 and d max = 100. Each 
set keeps a fixed value of a and /3, while the mixing parameter \x moves between 0.05 
and 0.50. Thus, it has 1, 000 graphs for each //, making a total of 10, 000 graphs. 

We built 3 sets, considering representative values of a and j3 in heterogeneous 
networks. 

• BENCH 1: a = 1.2, (3 = 3.0 

• BENCH2: a = 1.8, (3 = 1.2 

• BENCH3: a = 2.0, (3 = 2.0 
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0.0 0.2 0.4 0.6 0.8 1.0 
NMI 2 



Figure 3. Functional relationship between two normalizations of the mutual 
information: NMh and NMI 2 . 

We also tested other pairings of a G [1,3] and (5 G [1,3]. BENCH1 turned out to be 
the best-case, BENCH2 the worst-case, and BENCH3 a mean-case. 

We have used this benchmark for different reasons: (a) it simulates real networks by 
generating heterogeneous distributions. These distributions provide greater challenges 
to the community discovery algorithms with respect to fixed-degree networks like 
the ones generated by the GN benchmark [Girvan and Newman, 2002]. For example, 
heterogeneous networks are subject to resolution limit problems when global methods 
are applied; (b) the parameters adjust tightly to the proposed values, the \i distribution 
following a roughly bell-shaped curve around the desired //; and (c) it has a low 
complexity, which makes it suitable to generate a big set of graphs. 

5.2.2. Obtained results As explained in section 3, the uniform growth process returns 
an ordered list of vertices, such that either two consecutive vertices are neighbors in 
the same community, or else each of them belongs to its community border. Only 
after computing the first stage we get a partition that we can compare with the 
original one. Figure 4 analyzes the results of the three stages as a function of //, 
which is the most decisive parameter during the communities detection. It displays 
the mutual information between our partition and the one issued from the benchmark, 
after the end of each stage. We used the boxplot command of the R statistical 
software [R Development Core Team, 2008]. This command computes the quartiles for 
each fi, displaying: the median (second quartile); boxes representing the 3 rd and the I s * 
quartiles; and whiskers which are placed at the extremes of data. The plot in the upper 
left corner analyzes BENCH3, and shows only the medians for the three stages at the 
same time, for comparison purposes. The other plots are boxplots comparing BENCH1 
and BENCH2. 
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We observe that the results after the first stage on BENCH 1 and BENCH3 are successful 
for a wide range of values of /z, where the mutual information is larger than 0.9. BENCH2 
represents the worst-case, and greater values of /i make the mutual information decrease 
substantially. This is a typical behavior, and one of the reasons is that the first stage 
cuts the ordered list in sets every time that it reaches a community border; as the 
borders are very fuzzy for big values of fi, sometimes communities are split in two or 
more. Then, it is the second stage the one which corrects this problem, improving the 
last result in about 3%, being more effective for lower values of /i. Finally, the third 
stage makes a considerable gain in general, even for large values of /i. In fact, the mutual 
information improves more than 10% in the interval [i = [0.3, 0.5]. In the case of BENCH2 
and /i = 0.5 the third stage improves the median but extends the range of values of the 
mutual information, reaching a minimum value of 0.2. 
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Figure 4. Statistical analysis of the normalized mutual information between our 
partition and the communities known a priori, after each of the three stages of the 
community detection algorithm. These are results for BENCH 1, BENCH2 and BENCH3, 
each of them consisting on 1,000 networks for each value of [i, whose values range from 
0.05 to 0.50. The plot in the upper-left corner is for BENCH3, and represents median 
values of mutual information after each of the three stages. Each of the other plots 
compares BENCH1 (white) and BENCH2 (gray) for a different stage. \x varies from 0.05 
to 0.5 in steps of 0.05, but the boxplots are interlaced over the x-axis just for the sake 
of clarity. 
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5.2.3. A comparison with a modularity-based method Figure 5 compares the partitions 
found with our growth process based on the H fitness function, and a modularity 
based algorithm. We chose the Louvain algorithm [Blondel et al., 2008], which is one 
of the most efficient modularity-based methods. The points represent median values 
for the 1,000 different networks in benchmarks BENCH1 and BENCH2, varying the mixing 
parameter \i. The reference partition is the one computed a priori by Lancichinetti's 
benchmark, from which the networks are generated. So when we mention the mutual 
information for the growth process we mean the mutual information against the pre- 
computed communities. The same holds for the mutual information for the Louvain 
algorithm. 

We observe that our growth process represents a general improvement for the 
detection of communities in the benchmarks, and that the difference in performance 
increases for higher values of the mixing parameter //. This behavior will be argued in 
the next subsection. 
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Figure 5. Comparison between our growth process and Louvain's modularity- 
based method. We consider the communities generated a priori by Lancichinetti's 
benchmark, and we use them as a reference partition for the comparison. The picture 
compares the mutual information for our growth process and for Louvain's method. 
The points represent median values for the 1,000 networks generated for each different 
fi. (a) On the left, results for BENCH1: a = 1.2, fi = 3.0. (b) On the right, results for 
BENCH2: a = 1.8, [3 = 1.2. 



5.3. A correlation-based measure 

Let Cj, 1 < % < k be a partition of V. Consider the following random variables: select a 
pair (v,w) from E at random and define as a Bernoulli variable such that Lj = 1 if 
v G Cj. In the same way, we define Ri as a Bernoulli variable such that Ri — 1 if w e Cj. 
Thus, it follows that P(Lj = 1) = P(Ri = 1) = m v (Ci). If Ci is a community, we expect 
that P(Ri — l\L i — l)> P(Ri = 1), thus a sensible measure of the community quality 
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is the correlation p u , where 



Pij = P( L ii R j) 



m E (Ci x Cj) - m v (Ci)m v (Cj) 



^m v (C t )m v (C 3 )(l - m v (Ci))(l - m v (C 3 )) 



Notice also that pij > means that joining Cj to Cj will give an increment in the 
usual Newman modularity Q, and that p„ > means that 



as expected. In [Busch et al., 2010] the authors have studied the relationship between 
these coefficients and modularity maximization, and when py > they say that 
Ci and Cj are mutually submodular. This simply means that this pair of communities 
would be usually joined by agglomerative modularity maximization techniques, because 
their union increases modularity. 

Figure 6 depicts the values of the correlation for all the pairs (Cj, Cj) in one of 
the instances of BENCH2 with p = 0.30. The partition that we considered here is the 
one set a-priori by the algorithm. We found 82 pairs of communities (Ci,Cj),i ^ j 
that are not submodular (i.e., > 0). The communities in these pairs will not be 
detected by modularity-based techniques, and this fact might explain why our fitness 
growth function can outperform them, when the real communities do not fulfill what we 
call the submodular condition. On the other hand, all the negative correlations are very 
close to zero, indicating that most of the pairwise unions would not produce a significant 
change in the modularity functional. This fact is in accordance with the observation 
in [Good et al., 2010] that high-modularity partitions are prone to extreme degeneracy. 

In Figure 7 we analyze the existence of non-submodular communities for BENCH2. 
The y-axis represents the percentage of not submodular pairs (Cj, Cj),i ^ j. For each p, 
the boxes represent the 1,000 network instances with that p. The left plot corresponds 
to Lancichinetti's a priori partition, while the right plot is for the communities that we 
obtain. The linear behavior of the percentage as a function of p explains why modularity- 
based techniques tends to fail when the values of p are bigger. In fact, in the Louvain 
algorithm the communities are merged until the condition p^ < is achieved. 

5.4- Robustness analysis 

In order to study the robustness of our method in real networks where the actual 
communities are generally unknown, we propose to analyze the mutual information 
between different partitions starting from randomly chosen vertices, and observe the 
repeatability of the results. The studied networks include karate club [Zachary, 1977], 
the bottlenose dolphins network [Lusseau and Newman, 2004], the american col- 
lege football network in [Girvan and Newman, 2002], an e-mail interchange net- 
work [Guimera et al., 2003], Erdos-Renyi random graphs ER* [Erdos and Renyi, 1959], 
an instance from the BENCH3 benchmark with p = 0.40 (see section 5.2.1), a por- 
tion of arXiv [Cornell KDD Cup, 2003], a collaboration network in Condensed Matter 
ConMat [Girvan and Newman, 2002], and a portion of the World Wide Web network 
WWW [Albert et al., 1999]. Table 1 shows the sizes of these networks. 
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Figure 6. Matrix of correlations pij for the communities set a priori in one of the 
instances of BENCH2 with /j, = 0.30. We find that 82 pairs (Cj, Cj) outside the diagonal 
are not submodular (p^ > 0). 
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Figure 7. Boxplots representing the percentage of non-submodular community pairs 
(Ci,Cj),i ^ j (where > 0) for the 10,000 instances in BENCH2, as a function of 
fi. (a) Lancichinetti's a priori communities, (b) Communities obtained by our fitness 
growth process. 

It is a remarkable fact that the original (a priori) communities are not submodular 
or, in other words, that the benchmark generates partitions for which modularity 
optimization techniques would tend to fail. We also point out that a similar plot for 
the partitions obtained by the Louvain algorithm would show a constant zero for the 
percentage of non-submodular pairs. This is a mandatory fact for any modularity 
maximization agglomerative technique which attains a local maximum. 
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Table 1. Summary of results for the analyzed networks. The columns represent: 
network size (number of vertices and edges), average number of communities found 
with the Fitness Growth Process and standard deviation, and the amount of modules 
discovered by Louvain's algorithm 

Figure 8 shows the boxplots, together with the density functions, of the mutual 
information for each network. In each of them we picked a random vertex, run the 
algorithm, and took the resulting partition as the reference partition. Then we started 
the algorithm from other vertices, and measured the mutual information between these 
partitions and the reference partition. In small networks we considered all the vertices, 
and just 1000 different vertices for arXiv and ConMat networks, and 48 for the WWW 
network. The fact that we just consider one reference partition to compare with the 
others and do not make an all pairwise comparison is justified by the transitivity 
relationship that we found in 5.1.1. 

The first observation of Figure 8 is that the [Erdos and Renyi, 1959] random 
graphs (ER100, ERlk, ERIOk) give a wide range of values of mutual information 
when the robustness analysis is performed. This is an expected result, as it is 
in accordance with the fact that ER graphs do not have a community structure, 
as [Lancichinetti and Fortunato, 2011] points out. In fact, the amount of communities 
found is also very variable (see Table 1), varying from 1 to 1893. 

The e-mail case is also remarkable because the mutual information yields a wide 
range of values; this fact points out a probably poor community structure in this 
network. The other networks present high values of mutual information with small 
dispersions (i.e., boxplots are quite narrow). This trend is even more noticeable for 
the large networks. In fact, the WWW is an interesting case because all the mutual 
information values that we found lay around its median value of 0.989 with extremes 
at 0.989 ± 0.02, which means -by transitivity- that the different partitions found when 
starting the process from different vertices, are quite similar between them. 
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Figure 8. Boxplots (with density) representing the results for different real networks 
and some Erdds-Renyi random graphs. The networks are spread over the x-axis. The 
boxplots and densities show the mutual information between the partitions obtained 
when starting from different vertices and a reference partition. 



5.5. Application to a collaboration network 

Finally, we applied our algorithm to a network of coauthorships from the Condensed 
Matter E-Print Archive. We analyzed the giant component of the network, composed 
by 36,458 vertices and 171,736 edges. The result was a partition with 4425 communities, 
whose distribution follows a power-law on the community size (see Figure 9. a) which 
may be due to the self-similarity of the network [Song et al., 2005]. We remark the 
strong coincidence between the exponents on both distributions. 

While the biggest community in this network contains about 31% of the graph 
edges (53880 internal connections), it only has 406 vertices (the 1.1%). Evidently, this 
community has a strong cohesion. 

Figure 9.b depicts the density of connections between all pairs of communities C, 
and Cj, in terms of the correlation between two Bernoulli variables defined in 5.3. The 
strong correlation in the diagonal implies a high density of edges inside the communities. 
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Figure 9. (a) Community size and vertex degree distribution for the collaboration 
network CondMat. The histograms were built with a log-binning procedure, (b) Edges 
density between communities in terms of a correlation between Bernoulli variables, for 
the 20 biggest communities in CondMat. 



The correlation values close to zero outside the diagonal imply a random amount of 
inter-community edges, similar to the expected amount in a null model graph. 

6. Conclusions 

The work by [Lancichinetti et al., 2009] suggests the possibility of using different fitness 
functions for detecting local communities under a general procedure. In this work 
we have defined a fitness function H t and shown that it is essentially equivalent to 
the original one, which depends on a resolution parameter a. Then we proved an 
important fact: neither of the parameters (neither a nor t) play an important part in 
the vertex selection criterion, but only in the termination decision. This means, for 
example, that we can obtain a local community C t for some t, and then build the local 
community for t' > t by taking Ct and continuing the process until t! . So we proposed 
an unique fitness growth process which finds an ordering of the vertices such that the 
different communities lie one after the other. This sequence is the input of a three- 
staged algorithm that extracts a community partition of the graph. The algorithm is 
freely available to the scientific community as an open-source software which can be 
downloaded from http : / /code . google . com/p/commugp/. 

We also exploited a benchmark of heterogeneous graphs to test our method. On one 
side, we tested the correctness of the results by comparing them against communities 
defined a priori. On the other side, we gave an explanation on why global methods tend 
to fail on some heterogeneous networks. These ideas were illustrated by the use of a 
correlation measure and of normalized mutual information. 

Finally we showed that the method is robust for many real networks. By analizyng 
random graphs, we pointed out that the behavior of the method may allow us to 
differentiate networks with a strong community structure from randomly connected 
ones. 
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As a future work we plan to study different ways of changing the vertex selection 
criteria of the growth processes, in order to avoid vertex eliminations. We also intend 
to extend the results for detecting situations of overlapping communities. 
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